A study of Oklahoma Education Graduation Rates

Prepared by Rafee Shaik, rafees@umich.edu
github repository: https://github.com/rafeesumich/Oklahoma-Education-Graduation-Rate-Analysis

Overview of the analysis:

This work is a study of Oklahoma state graduation rates by ethnic groups in the year 2017. You can get the yearly graduation Rate data from Oklahoma state education department website. Here is the link: https://sde.ok.gov/sde/sites/ok.gov.sde/files/documents/files/2017%20Graduation%20Rate.xlsx

Dataset: This dataset contains graduation rates at three levels.

  1. Overall state level graduation rate
  2. School level graduation rate, that gives graduation rate at HighSchoo level.
  3. District level graduation rate, that gives graduation rate from PK to 9th grade.

Procedure:

  1. Prepare the dataset: Split the data into three datasets,
    i. State level
    ii. Highschool level, grades 9 to 12th
    iii. District level, grades PK to 8th.
  2. Exploratory Data Analysis:
    i. State level graduation rate analysis, this data is studied at the Ethnicity level, and identify the most and least performing groups at state level.
    ii. Study the Highschool level graduation rates, identify which groups are performing least and most.
    iii. Study of District level graduation rates, how each ethnic group is performing, see if the same graduation rates are carried over to Highschool and State level rates.

Visualization Technique:

I'm going to use mainly three visualization techniques for this study: I'm going to do a univariate analysis, best visualization techniques for univariate analysis are Boxplots, Violinplots, and Histograms

  1. Box plots: Box plots are a good starting point for univariate numerical analysis, Which means it analyzes single set of data. It summarizes this data into Quartiles: Q1(25%), Q2(50%) and Q3(75%). And show us inter quartile ranges (IQR) in a box, outliers will also be shown. If needed it can also show us mean as well.
    When to use it: To get a quick summary of single variable.
    Limitations: Box plots cant show distribution of data within a single dataset, Consider violin plot for more detailed study of univariate analysis

  2. Violin plots: Violin plots are best to analyze single variate numerical data. Violin plots can show us distribution of underlying dataset through kernel density function. The distribution can help us identify different categories of data within a single variable. Violin plots will also show us outliers, IQR (Inter Quartile Range), and mean when needed. each side of the violinplot can be plotted using different categories of data within same dataset.
    Limitations: Use a histogram for full univariate analysis. Even though Violin plot has Kernel Density function builtin, the control over the visualization is limited.
    When to use: To look at the underlying data distribution within a single variable.

  3. Histogram: Histograms are another univariate analysis visualization technique. Histograms are best fit to understand the distribution of data. It divides the variable value range into equally sized bins and calculates the frequency of the variable within each bin. Histograms are great fit for probability distribution of continuous variable.

Other visualization techniques used in this analysis:

  1. Scatter plot: Good fit for multivariate analysis. We can use this plot to analyze the relationship between multiple variables, can be used to visualize clustering of data within a dataset.
  2. Bar chart : Use Bar Charts to compare values of different variables
  3. Maps: Great tool to plot data over a geographical area and compare, similar to heatmap, but data will be plotted on geographical regions.

Visualization Library:

Overview: Main visualization library used in this analyis is seaborn.
Seaborn is good fit for my analysis in this paper. It provides abstraction layer on top of matplotlib,It produces attractive and informative statistical graphics, has good integration with pandas library. Seaborn visualization library is created on top of Matplotlib. All the matplotlib kwargs will work with seaborn. Its a opensource, 'free to use' library.
Creater of this library: Michael Waskom
How to install it:
from anaconda prompt: conda install seaborn
from python terminal: pip install seaborn
Limitation: Lack of support for interactivity


Here are the three packages you need for this analysis those are not installed on coursera jupyter notebooks
pip install watermark
pip install geopandas
pip install descartes

Hardware and Software dependencies:

CPython 3.7.3 IPython 7.6.1 pandas 0.24.2 seaborn 0.9.0 matplotlib 3.1.0 numpy 1.16.4 geopandas 0.6.1 warnings unknown compiler : MSC v.1915 64 bit (AMD64) system : Windows release : 10 machine : AMD64 processor : Intel64 Family 6 Model 62 Stepping 4, GenuineIntel CPU cores : 32 interpreter: 64bit -------00Other environments test on:---------------- CPython 3.7.3 IPython 7.7.0 pandas 0.25.0 seaborn 0.9.0 matplotlib 3.1.2 numpy 1.17.0 geopandas 0.6.2 warnings unknown compiler : GCC 7.3.0 system : Linux release : 4.14.70-67.55.amzn1.x86_64 machine : x86_64 processor : x86_64 CPU cores : 8 interpreter: 64bit
In [6]:
%load_ext watermark
%watermark -v -m -p pandas,seaborn,matplotlib,numpy,geopandas,warnings
CPython 3.7.3
IPython 7.7.0

pandas 0.25.0
seaborn 0.9.0
matplotlib 3.1.2
numpy 1.17.0
geopandas not installed
warnings unknown

compiler   : GCC 7.3.0
system     : Linux
release    : 4.14.70-67.55.amzn1.x86_64
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit

Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks

Rule 1: Tell a story for an audience
    Provided a good overview of this analysis in the overview section
Rule 2: Document the process, not just the results
    I documented the entire process starting the download the data, cleaning it and analyzing it
Rule 3: Use cell divisions to make steps clear
    Prepared code in individula code blocks and placed them in single jupyter cell
Rule 4: Modularize code
    Code is modularized by analysis type, Here my analysis is divided into three parts
Rule 5: Record dependencies
    Dependencies are listed there using watermark package
Rule 6: Use version control
    Code and Data has been backed up and commited on Github
Rule 7: Build a pipeline
Rule 8: Share and explain your data
    Code and Data has been made public on Github
Rule 9: Design your notebooks to be read, run, and explored
    Organized the notebook into different sections so that it can be easily read, run and explored 
Rule 10: Advocate for open research
    There are sections in this notebook that gives opportunity for further reasearch and analysis
In [8]:
#Import required libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import geopandas as gpd

#Supress warnings
import warnings
warnings.filterwarnings('ignore')

Get required dataset from responsible organization

In [9]:
#Get the dataset
#df_orig=pd.read_excel('https://sde.ok.gov/sde/sites/ok.gov.sde/files/documents/files/2017%20Graduation%20Rate.xlsx')
#I saved a copy of this file in github, you can use it if you're not able to find it on sde.ok.gov website
df_orig=pd.read_excel('2017 Graduation Rate.xlsx')

About this dataset:

This dataset is created by state of Oklahoma every year. It contains county name, school district name,                  student graduation rate summerized by ethnicity at multiple levels of schooling: PK-to-8th grade, and 9th-to-12th.            We used most recent data that is avialble for public from state of oklahoma.                                              Elementry and Middle school gradualtion data is summerized at District level.

Prepare Dataset for Analysis 1: State level graduation rates

In [10]:
#Make a copy of original dataset.
df_state=df_orig.copy(deep=True)

#filter state level data only
df_state=df_state[df_state['EducationAgencyType']=='State']
#Data cleaning: 
#remove percentage symboles and other panctuations from numerical columns and convert them into Pandas numerical type
#We are also going to sort the data by value to easily identify the most and least performing ethnic groups.
df_state['FourYearGraduationRate']=pd.to_numeric(df_state['FourYearGraduationRate'].str.replace('%',''))
df_state = df_state.sort_values('FourYearGraduationRate',ascending=False).reset_index(drop=True)
In [11]:
#Take a look at first few records in the dataframe
df_state.head()
Out[11]:
CohortGraduationYear EducationAgencyType Countycode Districtcode SiteCode FullCode CountyName DistrictName SchoolName GradeLevelLow GradeLevelHigh ReportSubgroup FourYearGraduationRate
0 2017 State NaN NaN NaN NaN NaN NaN NaN PK 12.0 Asian 86
1 2017 State NaN NaN NaN NaN NaN NaN NaN PK 12.0 White 84
2 2017 State NaN NaN NaN NaN NaN NaN NaN PK 12.0 Regular Education 84
3 2017 State NaN NaN NaN NaN NaN NaN NaN PK 12.0 All 83
4 2017 State NaN NaN NaN NaN NaN NaN NaN PK 12.0 American Indian 83
In [12]:
#Prepare a bar chart of this data, on x-axis mark the ethnicity, and on y-axis plot the graduation rate of each ethinicity.

#Set the figure size to 12 by 8 inches, this is matplotlib property.
plt.figure(figsize=(12,8))

#Use seaborn barplot method to plot the Bar chart. Most of the seabon artist methods return axes that is same as in Matplotlib.
ax = sns.barplot(x=df_state['ReportSubgroup'],y=df_state['FourYearGraduationRate']
            ,order=df_state['ReportSubgroup']
           )
# Use the axes object to set the X axis label and Y labels and Figure title
ax.set(xlabel='Ethnicity',ylabel='Granduation Rate',title='State level graduation rate in different ethnic groups')
#Since we hive long names for each bar we need to rotate them to 90 degree so that we can read them properly
for item in ax.get_xticklabels(): item.set_rotation(90)
#Also I'm going to put bar height at the top of the bar. rotated to 45 degree, plot this height as text. color it in magenta
for i, v in enumerate(df_state['FourYearGraduationRate'].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='m', va ='bottom', rotation=45)
Observations from above chart:

From this graph you can clearly see Asians are most performing and English-Language-Learner are leat performing

Demonstration of Data Preperation

Prepare Dataset for Analysis 2: Highschool and Elemetery school level analysis

In [13]:
#Prepare School and District level ETHINIC based dataframes
#Select only required columns
columns = ['CohortGraduationYear', 'EducationAgencyType', 
       'Districtcode',  'CountyName', 'DistrictName',
        'GradeLevelLow', 'GradeLevelHigh', 'ReportSubgroup',
       'FourYearGraduationRate']
df=df_orig[columns]
#Remove % sign and other markers from numerical data
df['FourYearGraduationRate'] = (df['FourYearGraduationRate'].str.replace('>=','')
                                .str.replace('%','').replace('***',np.nan).str.replace('<',''))
#We're going to study the data at ethnic level so remove aggregated information rows 
#(marked by 'All' in 'ReportSubgroup' column) from data 
df=df[df['ReportSubgroup']!='All']
#Remove null value rows
df=df.dropna()
#convert Graduation data to panda's numeric data type for easy analysis.
df['FourYearGraduationRate']=pd.to_numeric(df['FourYearGraduationRate'])
#Seperate this dataset into two sets, one with HighSchool level data and One upto 8th

#Grades '9 and up' dataset - Highschools
df_school= df[df['EducationAgencyType']=='School']
#Grades 'PK to 8th' dataset
df_district= df[df['EducationAgencyType']=='District']
In [14]:
#df[df['EducationAgencyType']=='District'].head()
df_school.tail()
Out[14]:
CohortGraduationYear EducationAgencyType Districtcode CountyName DistrictName GradeLevelLow GradeLevelHigh ReportSubgroup FourYearGraduationRate
3947 2017 School I001 WOODWARD Woodward 09 12.0 Hispanic 85
3949 2017 School I001 WOODWARD Woodward 09 12.0 White 88
3950 2017 School I001 WOODWARD Woodward 09 12.0 Regular Education 90
3952 2017 School I002 WOODWARD Mooreland 09 12.0 Economically Disadvantaged 50
3957 2017 School I002 WOODWARD Mooreland 09 12.0 White 84

Demonstration 1 Visualization: Boxplot

In [15]:
#School level Analytics
#Set figure size to 16X8 inches
plt.figure(figsize=(16,8))
# On X axis we're going to put Ethinic groups, 
#Seaborn will automatically split the data into these groups and give us a category wise boxplots
#and we want to show mean point as well.
ax2= sns.boxplot(x='ReportSubgroup',y='FourYearGraduationRate', data=df_school, showmeans=True)

#
for item in ax2.get_xticklabels(): item.set_rotation(90)
ax2.set(xlabel='Ethnicity',ylabel='Granduation Rate')
#Observe how I'm setting the title here, earlier I set the title using axes object along with X and Y axis labels.
#Increase the title side and use color green.
plt.gca().set_title('2017 Highschool Granduation Rates in Oklahoma by ethinicity', fontsize=18,color='g')
plt.show()
Observation:

We can clearly see Graduation Rate changed at HighSchool level from state level, at Highschool level, 'Regular Education' group is performing better than 'Asian' group. English Language Learners are sill performing least.

Demonstration 1 Visualization: Violinplot

Lets explore how data is distributed within each group, Violin plots are best in this case.

In [16]:
plt.figure(figsize=(16,8))
ax3= sns.violinplot(x='ReportSubgroup',y='FourYearGraduationRate', data=df_school,showmeans=True)
for item in ax3.get_xticklabels(): item.set_rotation(90)
plt.gca().set_title('2017 Highschool Granduation Rates in Oklahoma by ethinicity', fontsize=18,color='g')
ax3.set(xlabel='Ethnicity',ylabel='Granduation Rate')
plt.show()

# Violinplots have similar syntax compared to boxplots. 
#but they will provide significant information within single visualization
# you can also use one side of the plot for one category and other side for another category with in the plot.
# use the 'hue' and 'split' parameters to split the data within the variable.
#Here is example syntax
#ax = sns.violinplot(x="day", y="total_bill", hue="smoker", data=tips, palette="muted", split=True)
Observation from above graph:

The mean of of all ethnic groups is around 80%, except in group 'English Language Learners'. Also the data is skewed toward the first quantile of the set. From this graph we can conclude Graduation Retes are significantly lower in 'English Language Learners'

Analysis 2: District level graduation rates

Objective: At school level analysis (above graph) we concluded that graduation rate in 'English Language Learners' is significanly lower when compared to other groups. We will prepare similar analysis and see if 'English Language Learner' group gradualtion rate is lowest at Elementry-Middle-school level as well.

In [17]:
#District level Analytics with boxplot
plt.figure(figsize=(16,8))
ax4= sns.boxplot(x='ReportSubgroup',y='FourYearGraduationRate', data=df_district, showmeans=True)
for item in ax4.get_xticklabels(): item.set_rotation(90)
ax4.set(xlabel='Ethnicity',ylabel='Granduation Rate')
plt.gca().set_title('2017 Primary and Midschool Granduation Rates in Oklahoma by ethinicity', fontsize=18,color='g')
plt.show()
In [18]:
# District level graduation rates with Violinplots
plt.figure(figsize=(16,8))
ax5= sns.violinplot(x='ReportSubgroup',y='FourYearGraduationRate', data=df_district,showmeans=True)
for item in ax5.get_xticklabels(): item.set_rotation(90)
plt.gca().set_title('2017 Primary and Midschool Granduation Rates in Oklahoma by ethinicity', fontsize=18,color='g')
ax5.set(xlabel='Ethnicity',ylabel='Granduation Rate')
plt.show()
Conclusion on Analysis 2:

Graduation rate in 'English Language Learner' is the lowest in both Highschool level and Elementry+midschool levels

Analysis 3: Overall Distribution of Graduation Rate

Irrespective of ethnicity lets see how graduation rates are distributed within Highschool level and Mid+Elementry School levels

Data Prepation

In [19]:
#Remove all the non-numerical charectars from numeric columns
df_all=df_orig[df_orig['ReportSubgroup']=='All']
df_all['FourYearGraduationRate']=df_all['FourYearGraduationRate'].str.replace('>','')
df_all['FourYearGraduationRate']=df_all['FourYearGraduationRate'].str.replace('%','')
df_all['FourYearGraduationRate']=df_all['FourYearGraduationRate'].str.replace('<','')
df_all['FourYearGraduationRate']=df_all['FourYearGraduationRate'].str.replace('=','')
df_all=df_all[df_all['FourYearGraduationRate']!='***']

### Divide this data into two datasets, one at HighSchool level and one for Elementry+Mid School level
#Prepare school level and District level data frames
df_all['FourYearGraduationRate']=pd.to_numeric(df_all['FourYearGraduationRate'])
#Grades '9 and up' dataset - Highschools
df_school_all= df_all[df_all['EducationAgencyType']=='School']
#Grades 'PK to 9th' dataset

df_district_all= df_all[df_all['EducationAgencyType']=='District']

Analysis 3: Visualization with Histogram

Distribution of Highschool graduation rates on Histogram

In [20]:
#Histograms in seaborn can be ploted using the function distplot. Histograms take single dimention data, 
#optionally you can provide bin size with parameter 'bins' 
#and ask to plot density estimation as well with parameter 'kde=True/False'
#you can also indicate whether you want to plot rugplot or not.

plt.figure(figsize=(14,8))
ax4=sns.distplot(df_school_all['FourYearGraduationRate']
                ,bins=30
                ,kde=True
                ,rug=False
                )
ax4.set(ylabel='Density',xlabel='Granduation Rate', title='Highshool level graduation rate distribution')

plt.show()
In [21]:
#Observe the rug plot option usage in this graph
plt.figure(figsize=(14,8))
ax5=sns.distplot(df_district_all['FourYearGraduationRate'], bins=30
                ,kde=True
                ,rug=True
                )
plt.show()

Analysis 4: Heatmap over Okalahoma school district weight by graduation rates

We will be using geopandas for this analysis to plot a heat map over okalahoma with School level graduation rates

In [23]:
#
#School districts Shape file
#We need Oklahoma county and school district level shape files for this analysis
fp_cousub=r'SchoolMaps/tl_2017_40_cousub.shp'
#Download it from github
#fp_cousub='tl_2017_40_cousub.shp'
#you will read the shape file using geopandas's read_file method.
map_cousub= gpd.read_file(fp_cousub)

#Downsize the school level graduation rates to district level
downsized_school_df = df_school_all[['DistrictName','FourYearGraduationRate']]
downsized_school_df=downsized_school_df.groupby('DistrictName').agg({'FourYearGraduationRate':np.mean})
downsized_school_df=downsized_school_df.reset_index()
#Make sure District names are matching, turn both names to upper case
downsized_school_df.columns = ['NAME', 'FourYearGraduationRate']
downsized_school_df['NAME'] = downsized_school_df['NAME'].str.upper()
map_cousub['NAME'] = map_cousub['NAME'].str.upper()
#Merge shapefile with school level graduation rate dataframe
map_GradScore = map_cousub.merge(downsized_school_df, how='left',on='NAME')
In [25]:
#Plot the Graduation Rate data over Oklahoma school district areas
fig, ax = plt.subplots(1, 1,figsize=(15,15))
map_GradScore.plot(column='FourYearGraduationRate', ax=ax
                   #,cmap='tab20'
                   ,edgecolor='dimgrey'
                   ,legend=True
                   ,legend_kwds={'label': "Graduation rate legend", 'orientation': "horizontal"})
ax.set_title('Oklahoma 2017 School district level graduation rate',fontsize=20)
ax.set_axis_off()
plt.show()
In [27]:
#Preparing a county level heatmap
fp_county = "CountyMaps/tl_2017_us_county.shp"
#I shared this file on github
map_county_df = gpd.read_file(fp_county)
#Get oklahoma county data
map_df_ok_county=map_county_df[map_county_df['STATEFP']=='40']
# check data type so we can see that this is not a normal dataframe, but a GEOdataframe
#map_df_ok_county.head()

#Further downsample the school graduation data to county level
downsized_county_df = df_school_all[['CountyName','FourYearGraduationRate']]
downsized_county_df=downsized_county_df.groupby('CountyName').agg({'FourYearGraduationRate':np.mean})
downsized_county_df=downsized_county_df.reset_index()

downsized_county_df.columns=['NAME', 'FourYearGraduationRate']
downsized_county_df['NAME']=downsized_county_df['NAME'].str.strip()
map_df_ok_county['NAME']=map_df_ok_county['NAME'].str.upper()
map_df_county_join = map_df_ok_county.merge(downsized_county_df,how='left',on='NAME')
In [28]:
fig, ax = plt.subplots(1, 1,figsize=(15,15))
map_df_county_join.plot(column='FourYearGraduationRate', ax=ax
                   #,cmap='tab20'
                   , edgecolor='dimgrey'
                   ,legend=True
                   ,legend_kwds={'label': "Graduation rate legend", 'orientation': "horizontal"})
ax.set_title('Oklahoma 2017 county level graduation rate',fontsize=20)
ax.set_axis_off()
plt.show()
In [29]:
#lets combine the graduation data with county population data 
#and see if there is any correlation exists between country population and graduation rates
In [30]:
#download the oklahoma population data from link below
#https://www.oklahoma-demographics.com/counties_by_population
#Since extenal URLs are taking furever to respond I downloaded this datafram to a csv and attached to this assignment
#population = pd.read_html('https://www.oklahoma-demographics.com/counties_by_population')[0]
population = pd.read_csv('population.csv',delimiter='|')
population=population.iloc[:77]
population['Population'] = pd.to_numeric(population['Population'])
population['County'] = population['County'].str.replace(' County','')
population=population.rename(columns={'County':'NAME'})

population['NAME']=population['NAME'].str.upper()
pop_grd_rate_df = population.merge(downsized_county_df, how='left', on ='NAME')
In [31]:
#Seaborn scatterplot dipecting Population and graduation rates at county level
sns.scatterplot( x='Population',y='FourYearGraduationRate',data=pop_grd_rate_df)
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e44a5b8d0>
In [32]:
#Looks like there are three large counties that are causing our data to be skewed to the right, 
#so we're going to filter them out
sns.scatterplot( x='Population',y='FourYearGraduationRate',data=pop_grd_rate_df.iloc[3:])
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e448b15f8>
Plot this scatterplot using Plotly express and show trendline.
In [33]:
import plotly.express as px
fig = px.scatter(pop_grd_rate_df.iloc[3:], x='Population',y='FourYearGraduationRate', trendline="ols")
fig.show()
Observation from above chart:

There seem to be a low level positive correlation between county population and graduation rate.

In [35]:
#national_graduationRate_df = pd.read_html('https://www.governing.com/gov-data/education-data/state-high-school-graduation-rates-by-race-ethnicity.html')[0]
national_graduationRate_df = pd.read_csv('national_graduationRate_df.csv',delimiter='|')
#URL is: 'https://www.governing.com/gov-data/education-data/state-high-school-graduation-rates-by-race-ethnicity.html'
#national_graduationRate_df.head()
In [36]:
national_graduationRate_df.plot(kind='box')
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e3f6ff0b8>

Further exploration opportunity

One can take the oklahoma graduation rate in different ethnic groups and compare it againest National level graduation data

In [ ]:
 
In [ ]: